Weakly-supervised object localization aims to indicate the category as well as the scope of an object in an image given only the image-level labels. Most of the existing works are based on Class Activation Mapping (CAM) and endeavor to enlarge the discriminative area inside the activation map to perceive the whole object, yet ignore the co-occurrence confounder of the object and context (e.g., fish and water), which makes the model inspection hard to distinguish object boundaries. Besides, the use of CAM also brings a dilemma problem that the classification and localization always suffer from a performance gap and can not reach their highest accuracy simultaneously. In this paper, we propose a casual knowledge distillation method, dubbed KD-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention (CI), which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the de-biased object feature, we additionally propose a multi-teacher causal distillation framework to balance the absorption of classification knowledge and localization knowledge during model training. Extensive experiments on several benchmarks demonstrate the effectiveness of KD-CI-CAM in learning clear object boundaries from confounding contexts and addressing the dilemma problem between classification and localization performance.
translated by 谷歌翻译
Learning continuous image representations is recently gaining popularity for image super-resolution (SR) because of its ability to reconstruct high-resolution images with arbitrary scales from low-resolution inputs. Existing methods mostly ensemble nearby features to predict the new pixel at any queried coordinate in the SR image. Such a local ensemble suffers from some limitations: i) it has no learnable parameters and it neglects the similarity of the visual features; ii) it has a limited receptive field and cannot ensemble relevant features in a large field which are important in an image; iii) it inherently has a gap with real camera imaging since it only depends on the coordinate. To address these issues, this paper proposes a continuous implicit attention-in-attention network, called CiaoSR. We explicitly design an implicit attention network to learn the ensemble weights for the nearby local features. Furthermore, we embed a scale-aware attention in this implicit attention network to exploit additional non-local information. Extensive experiments on benchmark datasets demonstrate CiaoSR significantly outperforms the existing single image super resolution (SISR) methods with the same backbone. In addition, the proposed method also achieves the state-of-the-art performance on the arbitrary-scale SR task. The effectiveness of the method is also demonstrated on the real-world SR setting. More importantly, CiaoSR can be flexibly integrated into any backbone to improve the SR performance.
translated by 谷歌翻译
基于参考的图像超分辨率(REFSR)旨在利用辅助参考(REF)图像为超溶解的低分辨率(LR)图像。最近,RefSR引起了极大的关注,因为它提供了超越单图SR的替代方法。但是,解决REFSR问题有两个关键的挑战:(i)当它们显着不同时,很难匹配LR和Ref图像之间的对应关系; (ii)如何将相关纹理从参考图像转移以补偿LR图像的细节非常具有挑战性。为了解决RefSR的这些问题,本文提出了一个可变形的注意变压器,即DATSR,具有多个尺度,每个尺度由纹理特征编码器(TFE)模块组成,基于参考的可变形注意(RDA)模块和残差功能聚合(RFA)模块。具体而言,TFE首先提取图像转换(例如,亮度)不敏感的LR和REF图像,RDA可以利用多个相关纹理来补偿更多的LR功能信息,而RFA最终汇总了LR功能和相关纹理,以获得更愉快的宜人的质地结果。广泛的实验表明,我们的DATSR在定量和质量上实现了基准数据集上的最新性能。
translated by 谷歌翻译
领域自适应分段努力生成目标域的高质量伪标签并在其上重新训练分段的趋势趋势。在这种自我训练的范式下,一些竞争性方法已寻求潜在的空间信息,该信息建立了语义类别的特征质心(又称原型),并通过与这些质心的距离确定了伪标签候选者。在本文中,我们认为潜在空间包含更多要利用的信息,从而进一步迈出了一步以利用它。首先,我们不仅使用源域原型来确定目标伪标签,而且还像大多数传统方法一样,我们在双向上产生目标域原型来降低那些可能难以理解或无法进行适应的源特征。其次,现有尝试将每个类别模拟为单个和各向同性原型,同时忽略特征分布的方差,这可能导致类似类别的混淆。为了解决这个问题,我们建议通过高斯混合模型代表每个类别,以多种和各向异性原型表示,以根据概率密度估算源域的事实分布并估算目标样品的可能性。我们将我们的方法应用于gta5-> CityScapes和Synthia-> CityScaps任务,并在平均值上分别实现61.2和62.8,这显然优于其他竞争性的自我训练方法。值得注意的是,在某些类别中,我们的方法分别遭受了“卡车”和“公共汽车”等分类混乱的影响,我们的方法分别达到了56.4和68.8,这进一步证明了我们设计的有效性。
translated by 谷歌翻译
视频修复(例如,视频超分辨率)旨在从低品质框架中恢复高质量的帧。与单图像恢复不同,视频修复通常需要从多个相邻但通常未对准视频帧的时间信息。现有的深度方法通常通过利用滑动窗口策略或经常性体系结构来解决此问题,该策略要么受逐帧恢复的限制,要么缺乏远程建模能力。在本文中,我们提出了一个带有平行框架预测和远程时间依赖性建模能力的视频恢复变压器(VRT)。更具体地说,VRT由多个量表组成,每个量表由两种模块组成:时间相互注意(TMSA)和平行翘曲。 TMSA将视频分为小剪辑,将相互关注用于关节运动估计,特征对齐和特征融合,而自我注意力则用于特征提取。为了启用交叉交互,视频序列对其他每一层都发生了变化。此外,通过并行功能翘曲,并行翘曲用于进一步从相邻帧中融合信息。有关五项任务的实验结果,包括视频超分辨率,视频脱张,视频denoising,视频框架插值和时空视频超级分辨率,证明VRT优于大幅度的最先进方法($ \ textbf) {最高2.16db} $)在十四个基准数据集上。
translated by 谷歌翻译
照明黑盒神经网络的一个主要方法是特征归因,即识别网络预测的输入特征的重要性。最近提出了特征的预测信息作为衡量其重要性的代理。到目前为止,仅通过在网络内放置信息瓶颈来识别预测信息。我们提出了一种方法来识别输入域中的预测信息的特征。该方法导致对输入特征的信息的细粒度识别,并且对网络架构不可知。我们的方法的核心思想是利用输入的瓶颈,只能让输入与预测潜在功能相关的输入功能通过。我们使用主流特征归因评估实验比较了多个特征归因方法的方法。该代码可公开可用。
translated by 谷歌翻译
由于其高实用价值,无监督的域自适应人员重新识别受到显着的关注。在过去几年中,通过遵循聚类和FineTuning范式,研究人员建议利用他们的师生框架,以减少不同人重新识别数据集之间的域间差距。受到最近的教师学生框架基于方法的启发,它试图通过使学生从教师直接复制行为来模仿人类学习过程,或者选择可靠的学习材料,我们建议进行进一步的探索,以模仿不同方面的人类学习过程,\ Texit {IE},自适应更新学习材料,选择性地模仿教师行为,分析学习材料结构。探索的三个组件共同合作,构成了一个新的无监督域自适应人重新识别的方法,称为人类学习仿框架。三个基准数据集的实验结果证明了我们提出的方法的功效。
translated by 谷歌翻译
推断现实世界中物体的立体结构是一项具有挑战性但实用的任务。为了配备深层模型,通常需要大量的3D监督,这很难获得。有希望的是,我们可以简单地从合成数据中受益,其中成对地面真相很容易访问。然而,考虑到变体的纹理,形状和上下文,域间隙并非平凡。为了克服这些困难,我们提出了一个称为VPAN的单视3D重建的粘性感知自适应网络。为了将模型概括为真实的场景,我们建议实现几个方面:(1)外观:视觉上从单个视图中纳入空间结构,以增强表示表示的表现力; (2)铸造:在感知上将2D图像特征与具有跨模式语义对比度映射的3D形状先验对齐; (3)模具:通过将嵌入到所需的歧管中来重建目标的立体形状。对几个基准测试的广泛实验证明了拟议方法通过单视图从合成数据学习3D形状的歧管的有效性和鲁棒性。所提出的方法优于iOU 0.292和cd 0.108上的Pix3D数据集上的最先进的方法,并在Pascal 3D+上达到0.329和CD 0.104。
translated by 谷歌翻译
We consider the problem of unsupervised domain adaptation in semantic segmentation. A key in this campaign consists in reducing the domain shift, i.e., enforcing the data distributions of the two domains to be similar. One of the common strategies is to align the marginal distribution in the feature space through adversarial learning. However, this global alignment strategy does not consider the category-level joint distribution. A possible consequence of such global movement is that some categories which are originally well aligned between the source and target may be incorrectly mapped, thus leading to worse segmentation results in target domain. To address this problem, we introduce a category-level adversarial network, aiming to enforce local semantic consistency during the trend of global alignment. Our idea is to take a close look at the category-level joint distribution and align each class with an adaptive adversarial loss. Specifically, we reduce the weight of the adversarial loss for category-level aligned features while increasing the adversarial force for those poorly aligned. In this process, we decide how well a feature is category-level aligned between source and target by a co-training approach. In two domain adaptation tasks, i.e., GTA5 → Cityscapes and SYN-THIA → Cityscapes, we validate that the proposed method matches the state of the art in segmentation accuracy.
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译